REF: Check before calling DTA._from_sequence instead of catching #29589

jbrockmendel · 2019-11-13T04:38:45Z

Comment in core.apply is unrelated, just doesnt merit its own PR.

…ster-gb2

jorisvandenbossche · 2019-11-13T09:10:17Z

pandas/core/groupby/groupby.py

                    result = arr._from_sequence(result, dtype="datetime64[ns, UTC]")
                    result = result.astype(dtype)
-                except TypeError:


I suppose you can end up here with a different dtype that is not convertible to datetime64 (there is no guarantee that your groupby function or aggregation returns the same dtype as the original column). In such a case, we should not raise the error, but rather fall back to the original result as done now?

AFAICT the relevant functions are all dtype-preserving (well, e.g. sometimes int can be cast to float, but not relevant for datetime64). Is there a case that im forgetting?

What do you mean with "the relevant functions" ? I think this gets used for user defined functions as well?

most of the calls that get here go through our cython functions, but you're right that we can also get here from _python_agg_general. But in the case of user defined functions that may not be dtype-preserving, we shouldn't be doing this casting at all, regardless of whether we pre-check vs catch

jorisvandenbossche · 2019-11-14T12:53:07Z

you're right that we can also get here from _python_agg_general. But in the case of user defined functions that may not be dtype-preserving, we shouldn't be doing this casting at all, regardless of whether we pre-check vs catch

Not fully sure if it should try this casting or not, but right now it does reach here, so your change has implications for that.

Consider this example:

In [5]: df = pd.DataFrame({'a': [0, 0, 1, 1], "b": pd.date_range("2012-01-01", periods=4, tz='UTC')}) 

In [6]: df.groupby('a')['b'].agg(lambda x: x.iloc[-1] - x.iloc[0])
Out[6]: 
a
0   1 days
1   1 days
Name: b, dtype: timedelta64[ns]

That will fail to convert to datetime, raise a TypeError and thus now fail with this branch.

Now, you are right that we maybe shouldn't try to cast (or at least only if the scalars are actual datetime objects), because if you write a function that returns integers, that now incorrectly casts back to datetime64:

In [3]: df.groupby('a')['b'].agg(lambda x: x.iloc[0].year)  
Out[3]: 
a
0   1970-01-01 00:00:00.000002012+00:00
1   1970-01-01 00:00:00.000002012+00:00
Name: b, dtype: datetime64[ns, UTC]

However, this case is a regression, as it worked correctly in pandas 0.23 (started to fail in 0.24, probably with the datetime arrays refactor).
(I seem to remember some discussion about this, it might be a reason that DatetimeArray._from_sequence should be more strict to only accept actual scalars of the array's type. But this is a digression here, will see to open a new issue for it )

jorisvandenbossche · 2019-11-14T12:55:36Z

About the "need" to cast here, eg this example would otherwise not have tz-aware result I think:

In [2]: df.groupby('a')['b'].agg(lambda x: x.iloc[0])                                                                                                                                         
Out[2]: 
a
0   2012-01-01 00:00:00+00:00
1   2012-01-03 00:00:00+00:00
Name: b, dtype: datetime64[ns, UTC]

jbrockmendel · 2019-11-14T15:55:45Z

@jorisvandenbossche first I agree that this should be on hold until we sort out the desired when-to-try-casing behavior.

My read of your last two comments suggests that you agree with my previous comment that we should not be re-casting for UDFs?

jreback · 2019-11-14T15:58:00Z

My read of your last two comments suggests that you agree with my previous comment that we should not be re-casting for UDFs?

I agree with this statement

jorisvandenbossche · 2019-11-14T16:00:15Z

It would also be a regression to not cast (see the example in my last comment above; that would no longer output the correct dtype if there was no cast back to the original type).

So I think we should try some casting, but not in such a liberal way as done now (similar to the "soft conversion" in infer_objects, which will not cast integers to datetime64, but will cast datetime objects)

jbrockmendel · 2019-11-14T17:03:05Z

@jorisvandenbossche for non-UDFs the behavior in this PR (and master) is correct. For UDFs you have convinced me that the casting should not be done.

Does this accurate reflect your views?

jorisvandenbossche · 2019-11-14T17:06:39Z

Not really, in my last comment I said: "So I think we should try some casting, but not in such a liberal way as done now."

Now, "trying to cast" is always a box of pandora (but we do that a lot in pandas in all kinds of cases), I am mainly noting that not doing it would give a regression in my example above (#29589 (comment))

jbrockmendel · 2019-11-14T17:25:23Z

Locally, if I disable this casting exclusively for UDFs, both the df.groupby('a')['b'].agg(lambda x: x.iloc[0]) and df.groupby('a')['b'].agg(lambda x: x.iloc[0].year) work correctly.

"So I think we should try some casting, but not in such a liberal way as done now." is vague compared to "do this casting only for non-UDFs". Can you phrase your preferred approach more specifically?

jorisvandenbossche · 2019-11-14T18:03:14Z

Can you show the result of what you get locally?

jbrockmendel · 2019-11-14T19:16:34Z

Can you show the result of what you get locally?

After adding a cy=True kwarg to _try_cast and passing cy=False in the two _try_cast calls within _python_agg_general (all in core.groupby.groupby) that just disables the casting in question:

>>> df = pd.DataFrame({'a': [0, 0, 1, 1], "b": pd.date_range("2012-01-01", periods=4, tz='UTC')})

>>> df.groupby('a')['b'].agg(lambda x: x.iloc[0])
a
0   2012-01-01 00:00:00+00:00
1   2012-01-03 00:00:00+00:00
dtype: datetime64[ns, UTC]

>>> df.groupby('a')['b'].agg(lambda x: x.iloc[0].year)
a
0    2012
1    2012
dtype: int64

Does this match the expected/correct output?

jorisvandenbossche · 2019-11-14T19:29:19Z

Not sure what we do differently, what I tried was to just comment out the full block with casting in _try_cast (but I might be missing complications of this at other locations):

diff --git a/pandas/core/groupby/groupby.py b/pandas/core/groupby/groupby.py
index 280f1e88b..ab347b0bd 100644
--- a/pandas/core/groupby/groupby.py
+++ b/pandas/core/groupby/groupby.py
@@ -788,29 +788,29 @@ b  2""",
         else:
             dtype = obj.dtype
 
-        if not is_scalar(result):
-            if is_datetime64tz_dtype(dtype):
-                # GH 23683
-                # Prior results _may_ have been generated in UTC.
-                # Ensure we localize to UTC first before converting
-                # to the target timezone
-                arr = extract_array(obj)
-                try:
-                    result = arr._from_sequence(result, dtype="datetime64[ns, UTC]")
-                    result = result.astype(dtype)
-                except TypeError:
-                    # _try_cast was called at a point where the result
-                    # was already tz-aware
-                    pass
-            elif is_extension_array_dtype(dtype):
-                # The function can return something of any type, so check
-                # if the type is compatible with the calling EA.
-
-                # return the same type (Series) as our caller
-                cls = dtype.construct_array_type()
-                result = try_cast_to_ea(cls, result, dtype=dtype)
-            elif numeric_only and is_numeric_dtype(dtype) or not numeric_only:
-                result = maybe_downcast_to_dtype(result, dtype)
+        # if not is_scalar(result):
+        #     if is_datetime64tz_dtype(dtype):
+        #         # GH 23683
+        #         # Prior results _may_ have been generated in UTC.
+        #         # Ensure we localize to UTC first before converting
+        #         # to the target timezone
+        #         arr = extract_array(obj)
+        #         try:
+        #             result = arr._from_sequence(result, dtype="datetime64[ns, UTC]")
+        #             result = result.astype(dtype)
+        #         except TypeError:
+        #             # _try_cast was called at a point where the result
+        #             # was already tz-aware
+        #             pass
+        #     elif is_extension_array_dtype(dtype):
+        #         # The function can return something of any type, so check
+        #         # if the type is compatible with the calling EA.
+
+        #         # return the same type (Series) as our caller
+        #         cls = dtype.construct_array_type()
+        #         result = try_cast_to_ea(cls, result, dtype=dtype)
+        #     elif numeric_only and is_numeric_dtype(dtype) or not numeric_only:
+        #         result = maybe_downcast_to_dtype(result, dtype)
 
         return result

and then I get this (notice that the resulting dtype is not tz-aware here, in constrast with your example):

In [1]: df = pd.DataFrame({"a": [0, 0, 1, 1], 
   ...:                    "b": pd.date_range("2012-01-01", periods=4, tz='UTC')})

In [2]: df.groupby('a')['b'].agg(lambda x: x.iloc[0])
Out[2]: 
a
0   2012-01-01
1   2012-01-03
Name: b, dtype: datetime64[ns]

jbrockmendel · 2019-11-14T19:37:21Z

Not sure what we do differently, what I tried was to just comment out the full block with casting

I just changed the if not is_datetime64tz_dtype(result.dtype): to if not is_datetime64tz_dtype(result.dtype) and not cy:

I'm now working on a new branch that should handle the cases you've mentioned correctly.

jbrockmendel · 2019-11-15T22:11:20Z

@jorisvandenbossche im working on adding the examples from here into #29641, want to confirm the expected output. The results I posed here did not have a name for the Series; should the name be "b"?

jbrockmendel · 2019-11-17T16:07:56Z

Closing in favor of #29641

jbrockmendel added 6 commits November 12, 2019 17:03

check instead of try/except

47a5e3a

comment

a13f2e6

revert

dc758e3

Merge branch 'master' of https://github.com/pandas-dev/pandas into fa…

8eabaab

…ster-gb2

Merge branch 'master' of https://github.com/pandas-dev/pandas into fa…

b6d4617

…ster-gb2

comment

21b14cf

jorisvandenbossche requested changes Nov 13, 2019

View reviewed changes

jreback added Clean Datetime Datetime data dtype labels Nov 13, 2019

jreback added this to the 1.0 milestone Nov 13, 2019

jreback approved these changes Nov 13, 2019

View reviewed changes

jbrockmendel mentioned this pull request Nov 15, 2019

REF: make _aggregate_series_pure_python extraction behave like the cython version #29641

Merged

jbrockmendel closed this Nov 17, 2019

jbrockmendel deleted the faster-gb2 branch November 21, 2019 19:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: Check before calling DTA._from_sequence instead of catching #29589

REF: Check before calling DTA._from_sequence instead of catching #29589

jbrockmendel commented Nov 13, 2019

jorisvandenbossche Nov 13, 2019

jbrockmendel Nov 13, 2019

jorisvandenbossche Nov 13, 2019

jbrockmendel Nov 13, 2019

jorisvandenbossche commented Nov 14, 2019

jorisvandenbossche commented Nov 14, 2019

jbrockmendel commented Nov 14, 2019

jreback commented Nov 14, 2019

jorisvandenbossche commented Nov 14, 2019

jbrockmendel commented Nov 14, 2019

jorisvandenbossche commented Nov 14, 2019

jbrockmendel commented Nov 14, 2019

jorisvandenbossche commented Nov 14, 2019

jbrockmendel commented Nov 14, 2019

jorisvandenbossche commented Nov 14, 2019

jbrockmendel commented Nov 14, 2019

jbrockmendel commented Nov 15, 2019

jbrockmendel commented Nov 17, 2019

REF: Check before calling DTA._from_sequence instead of catching #29589

REF: Check before calling DTA._from_sequence instead of catching #29589

Conversation

jbrockmendel commented Nov 13, 2019

jorisvandenbossche Nov 13, 2019

Choose a reason for hiding this comment

jbrockmendel Nov 13, 2019

Choose a reason for hiding this comment

jorisvandenbossche Nov 13, 2019

Choose a reason for hiding this comment

jbrockmendel Nov 13, 2019

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 14, 2019

jorisvandenbossche commented Nov 14, 2019

jbrockmendel commented Nov 14, 2019

jreback commented Nov 14, 2019

jorisvandenbossche commented Nov 14, 2019

jbrockmendel commented Nov 14, 2019

jorisvandenbossche commented Nov 14, 2019

jbrockmendel commented Nov 14, 2019

jorisvandenbossche commented Nov 14, 2019

jbrockmendel commented Nov 14, 2019

jorisvandenbossche commented Nov 14, 2019

jbrockmendel commented Nov 14, 2019

jbrockmendel commented Nov 15, 2019

jbrockmendel commented Nov 17, 2019